Learn more about JAX LondonSTAY TUNED!
TornadoVM is designed to provide a seamless way to offload compute-intensive tasks in Java applications to these accelerators, enabling developers to leverage the performance benefits of hardware accelerators without having to write low-level code in CUDA, OpenCL, or other hardware-specific parallel programming languages.
Motivation
Heterogeneous hardware architectures, which refer to systems that integrate several types of processing units, such as CPUs, GPUs, TPUs, FPGAs, and ASICs, are designed to optimize performance and energy efficiency for specific workloads, enabling more efficient and specialized processing for various applications, ranging from artificial intelligence (AI) and machine learning (ML) to gaming, scientific research, and autonomous vehicles.
One significant factor contributing to the rise of heterogeneous hardware is the increasing demand for specialized computing capabilities. Traditional general-purpose CPUs, while versatile, may not provide the optimal performance for specialized tasks such as deep learning, cryptography, or image processing. As a result, specialized hardware units, such as GPUs for parallel processing, have gained popularity due to their superior performance in specific tasks.
Figure 1 shows a simplification of a representation of a computing system that contains a multi-core CPU, a GPU, and a FPGA and which programming language could be used for each hardware. To program applications that are going to be run on CPUs, developers can use programming languages such as Java, R, C/C++, Python, etc. However, to access heterogeneous hardware, developers must also adapt their existing applications with specialized programming models and frameworks for each architecture. For example, to run on GPUs, developers can use CUDA, Intel oneAPI, or domain-specific libraries such as PyTorch. This creates code fragmentation in the source code of applications, in which diverse programming models must be mixed to use all hardware available. An analogous situation applies to programming FPGAs.
Fig. 1: Computing system representation
Ideally, we want to avoid code fragmentation as well as easy programmability and maintainability of applications to be executed on heterogeneous computing systems. TornadoVM focuses on solving this. It allows Java developers, to offload and accelerate Java programs on hardware accelerators in a transparent manner (Figure 2).
Fig. 2: Offloading and accelerating Java programs
Although TornadoVM is focused on Java, it can be used from other programming languages implemented on top of the Java Virtual Machine (JVM), such as Truffle programming languages (e.g., R, Python, Ruby, JavaScript). TornadoVM abstracts away the complexity of creating and handling GPU/FPGA applications. Instead, it offers programming abstractions that developers can use to program with different GPU/FPGA backends underneath.
TornadoVM, as in the latest version v0.15, supports three backends: a CUDA/PTX backend for NVIDIA GPUs, and OpenCL backend for discrete GPUs, integrated GPUs, FPGAs and CPUs, and a Level Zero backend for Intel integrated GPUs. Thus, it can run on many types of devices from different vendors, such as NVIDIA, AMD, Intel, Xilinx and ARM (Figure 3).
Fig. 3: TornadoVM can run on many types of devices from different vendors.
TornadoVM
TornadoVM can help developers to write code that can be easily offloaded to hardware accelerators, with the goal of increasing performance without having to get deep understanding of the underlaying computing architectures or heterogeneous parallel programming models. Let us now discuss what TornadoVM offers to developers to achieve this goal.
TornadoVM APIs
TornadoVM offers APIs to tackle various levels of abstraction when programming heterogeneous devices. When designing a parallel programming API, we formulate the following questions:
- How do we represent parallelism in a programming language that was not design for parallelism and heterogeneous programming, such as Java?
- How do we identify the code to be accelerated?
- How do we run the code and explore different optimizations?
The TornadoVM APIs try to address each of these questions with diverse levels of abstraction. The following diagram (Figure 4) shows the constructs within TornadoVM for each of the aforementioned challenges. To solve the problem a) TornadoVM offers tasks and annotations. To identify the code to be offloaded (b) TornadoVM offers a Task-Graph API. Finally, to run and explore different optimizations, TornadoVM exposes an API for defining execution plans.
As a side note, please note that TornadoVM is in active development and the APIs might change in future versions. This post refers to the latest TornadoVM version available at the time of writing this post (v0.15). Let now us explore each of these abstractions in more detail.
How do we represent parallelism?
In general, there are two main approaches to parallelize the code: a) via annotations; and b) via some constructs in the programming language. Annotations have been explored by multiple programming models, such as OpenMP (to parallelize sequential programs written in Fortran and C/C++ and automatically use multiple CPU cores), and more recently, OpenACC and OpenMP-4, in which programmers also annotate Fortran, and C/C++ sequential code to run on GPUs.
The advantage of this approach is that developers can still provide a sequential implementation of the problem to be solved, and an optimizing compiler could generate parallel code using the annotations that developers provide. This approach, although it is very convenient for programmers, has some drawbacks. It is sometimes hard to automatically parallelize applications, and to compensate for this, such programming models usually offer a considerable number of annotations (see for example OpenMP 4 for GPU programming).
The second approach is to provide explicit parallel abstractions that developers can use to program and access data in parallel. For instance, by providing a thread identifier, providing abstraction to access different memory hierarchies, barriers, etc. Note that with this approach, although it is more powerful for developers because it offers a higher level of control, it also has more complexity to learn and maintain, because it usually requires knowledge of a new programming model, which is commonly linked to an execution model (how parallel hardware executes code). Additionally, since these new constructs are not part of the original programming language, it usually splits the original execution model into different execution models (one for the original programming model, e.g., Java, and another one that maps the heterogeneous hardware).
TornadoVM offers APIs for these two abstractions
It exposes an API for annotating sequential loops with parallel information (that we call loop parallel API), and it exposes a lower-level API (that we call Kernel API) to access low-level information about the hardware. Programmers can choose one of these abstractions to provide parallel code to run on heterogeneous hardware. The loop parallel API is suitable for non-GPU experts and fast prototyping, while the kernel API is suitable for GPU experts or developers that want to port existing CUDA/OpenCL kernels (functions that run on the GPU) to the Java platform. However, in any case, the TornadoVM Just-In-Time compiler will optimize and offload the Java code to the corresponding parallel implementation (e.g., OpenCL, SPIR-V or NVIDIA PTX).
Let us now see an example of how a Java application can be expressed in TornadoVM. For simplification, I will focus on the loop parallel API. I will follow the examples explained during the JAX’22 presentation at Mainz. All examples are public and available on GitHub:
https://github.com/jjfumero/tornadovm-examples
The example I am going to use is a blur-filter application to introduce a blur-effect in a JPEG image. This is a quite common filter for computational photography, for example, to blur the background and achieve better subject separation.
On GitHub, you can find details about how to build TornadoVM. Keep in mind that TornadoVM is almost fully implemented in Java, except for a JNI library to call the corresponding GPU/FPGA drivers (e.g., via OpenCL, CUDA, or Intel Level Zero). Thus, for the Java Virtual Machine (JVM), a TornadoVM application is just another Java program with the peculiarity that the TornadoVM runtime system will optimize and offload, on the fly, Java methods to the target GPU code.
The following code snippet shows a fragment of the blur-filter sequential implementation. Note that we do not show the details of the filter to simplify the explanation. However, you can visit the GitHub page for all the details about this computational photography filter:
void channelConvolutionSequential(int[] channel, int[] channelBlurred, final int numRows, final int numCols, float[] filter, final int filterWidth { for (int r = 0; r < numRows; r++) { for (int c = 0; c < numCols; c++) { float result = 0.0f; // compute result based tile of size filterWidth: … // Store the resulting pixel channelBlurred[r * numCols + c] = result > 255 ? 255 : (int) result; } } }
The input is represented by the first parameter called channel, and the output is represented by the second parameter (channelBlurred). We process this filter per Red-Blue-Green (RGB) channel. Thus, we will have three invocations to this method to process the red, blue, and green pixels of the input image. The final image will be represented as a merge for all the channels.
As we can see, the code is a simple two-nested loop. Each loop traverses a dimension of the image (the x-axis and y-axis of the image). The good thing about this type of applications is that each pixel can be computed independently of any other pixel from the image. This type of applications are called embarrassedly parallel applications. Thus, potentially, if we have enough hardware resources available, we could run each pixel of the image in a dedicated hardware thread.
If we use the loop parallel API, we annotate the loops that are parallelizable. With TornadoVM, developers can parallelize up to three nested loops, which generates a 3D kernel in OpenCL, NVIDIA PTX and SPIR-V. In this example, we can parallelize the outermost loops, providing a 2D parallel kernel as follows:
void channelConvolutionSequential(int[] channel, int[] channelBlurred, final int numRows, final int numCols, float[] filter, final int filterWidth { for (@Parallel int r = 0; r < numRows; r++) { for (Parallel int c = 0; c < numCols; c++) { float result = 0.0f; // compute result based tile of size filterWidth: … // Store the resulting pixel channelBlurred[r * numCols + c] = result > 255 ? 255 : (int) result; } } }
Using the parallel loop is easy: developers just add the @Parallel at the loop level. However, developers must also reason and think about where to put the annotations. We can add the @Parallel annotation at the loop level when there are no data dependencies, for instance, that, in a single loop iteration, there are no elements that depend on the previous iterations. Otherwise, the TornadoVM JIT compiler will fail to generate parallel code. This is similar to other parallel programming models such as OpenMP or OpenACC. This might be seen as limiting factor, but TornadoVM has been used to express many types of computations such as map, reduce, convolutions, stencils, etc. Note that TornadoVM also offers an annotation for performing parallel reductions called @Reduce.
How do we identify the methods to be offloaded and accelerated?
So far, what we have achieved is to parallelize a Java method using the TornadoVM loop API with the @Parallel annotation. Thus, parallelization is defined within Java methods. But how does TornadoVM know which methods to select?
To do so, TornadoVM exposes a Task-Graph API in which developers can express which tasks (Java methods) are going to be accelerated. Methods can come from different modules and different Java classes. Additionally, developers need to specify the data involved to perform the computation. Thus, task-graphs identify Java methods (tasks) to be accelerated, and data to be offloaded. Continuing our blur-filter example, we can build a Task-Graph as follows:
.transferToDevice(DataTransferMode.FIRST_EXECUTION, red, green, blue, filter) .task("red", BlurFilter::compute, red, redFilter, w, h, filter, FILTER_WIDTH) .task("green", BlurFilter::compute, green, greenFilter, w, h, filter, FILTER_WIDTH) .task("blue", BlurFilter::compute, blue, blueFilter, w, h, filter, FILTER_WIDTH) .transferToHost(DataTransferMode.EVERY_EXECUTION, redFilter, greenFilter, blueFilter);
Let’s go step by step. First, we build an object of type TaskGraph. In the constructor, we give a name to the whole computation. It could be any name. Developers can use these names to perform some optimizations in a later stage (e.g., change the GPU).
The second line specifies the data that needs to be copied (or transferred) from the main CPU (that resides on the Java heap) to the device (e.g., a GPU). This is done because, frequently, discrete GPUs do not share the memory with the main CPU. Thus, new buffers must be allocated, and data must be transferred before running the GPU/FPGA kernels. TornadoVM has different modes to copy data. For example, in this case, we specify that the red, green, blue, and filter arrays must be copied only the first time the whole task graph is executed. This is useful if we have Read-Only data and the task graph is executed multiple times. Thus, there is no need to perform a copy from the CPU to the GPU every time the task graph is executed.
Then, we define a set of three tasks. Each task corresponds to an invocation of a Java method. In this example, we are invoking the same method multiple times for performing a blur-filter effect per RGB channel from the input image. Thus, one task for the red channel, one for the green channel and one for the blue channel. The parameters for each tasks are as follows: the first parameter is a string that represents the name we want to give for each task. Each task must have a different name. Otherwise, the TornadoVM runtime will launch an exception. The second parameter is a reference (or a Java lambda expression) to the method to be accelerated. The rest of the parameters are the parameters for the function call we are performing (as it was executed in a normal Java function call).
Finally, there is a call to transfer to host the final results. Similarly to the data transfers form the host to the device that we saw in line 2, we can also specify if the output data must be copied every time. In this case, we specify that the copies must be performed every time. However, TornadoVM has a mode called USER_DEFINED in which data is copied under demand, using the optimization plan that we will explain in the next section.
Note that a Task-Graph does not offload or execute any code. A Task-Graph in TornadoVM is a data structure that allows the definition of the execution and the data involved, but it does not execute anything. To execute, we must build an execution plan.
How do we run the code and explore different optimizations?
To execute a Task-Graph we must close the definition and build an execution plan. To close a task-graph definition, we invoke the snapshot method of a task graph. This method creates an immutable task graph. Thus, the resulting object cannot be used to append new tasks or add new data.
ImmutableTaskGraph itg = parallelFilter.snapshot();
We can still modify the original task graph object, but this does not have any effect on the immutable task graph. This is by design, because we do not want to add/remove data or code while the GPU runs the application.
Finally, we can create an instance of an execution plan and launch our application.
TornadoExecutionPlan executionPlan = new TornadoExecutionPlan(itg ); excutionPlan.execute();
The execute method is a blocking call, and at this point, TornadoVM will compile all tasks that belong to the same task graph, optimize them and generate code for the selected backend. If developers do not select a backend (e.g., OpenCL). TornadoVM will run on a default one. The default depends on the installation and the computing system that TornadoVM runs on. For example, on a Linux PC with an NVIDIA GPU, the default backend (as in TornadoVM v0.15) is OpenCL.
But wait, you said an execution plan can also be used for defining optimizations, didn’t you?
Yes, through an execution plan, developers can change the GPU/FPGA to use, can enable the profiler, can tune the GPU threads to deploy, etc.
For example, the following code snippet enables the profiler, performs a warmup phase (compiles the code before the run) and selects a specific device:
executorPlan.withProfiler(ProfilerMode.SILENT) .withWarmUp() .withDevice(nvidiaRTXGPU);
To materialize this action, we call the execute method again:
executionPlan.execute();
For more information about all actions of an execution plan, you can visit the documentation webpage.
Performance
We have created an application to process a blur filter in a JPEG image. Let us now run it and see the performance we get. I am using an input JPEG image of 5K x 4K pixels, using Java JDK 17 and the TornadoVM v0.15 build. My computer setup is a Dell XPS laptop that contains an i9-10885H Intel CPU, an Intel 630 HD Graphics, and an NVIDIA RTX 2060 GPU.
The following plot (Figure 6) shows a performance comparison between different Java (single thread and multi-threaded) and TornadoVM running on different accelerators. The y-axis shows the execution runtime in milliseconds, while the x-axis shows the different implementation and devices. We run the application 100 iterations and the reported time shows the time for the last iteration.
The first bar shows the execution time of the single-threaded Java application, and we see that it takes around 135 ms. The Java multi-threaded version, which was implemented using the Java stream API and executed with 16 CPU cores, takes 10ms per iteration. TornadoVM can also run on CPUs using the OpenCL backend. This results in 6ms to process the blur-filter; thus, it is 50% faster than Java multi-threaded. This is usually due to the use of vector processing units on the CPU.
When running on GPUs, TornadoVM can achieve up to 437x (0.3 ms) on the NVIDIA RTX 2060. Note that TornadoVM can run on the same GPU using different backends. For example, running on the Intel Integrated Graphics with Level Zero and SPIR-V backend offers higher performance than the OpenCL backend of TornadoVM (48x compared to 27x).
If we compare the code using the Java multi-thread implementation, which is the maximum performance we can get for this application without using any external libraries, TornadoVM achieves speedups of up 30x when running on an NVIDIA RTX 2060 (Figure 7).
Key Features of TornadoVM
Apart from offering APIs for parallel programming and a Just-In-Time Compiler, TornadoVM also offers a set of features that makes this framework unique. This section briefly discusses some of these features.
Automatic Task Migration
TornadoVM can migrate, at runtime, the execution from one device to another without restarting the application. This is called dynamic reconfiguration and it can be enabled using the execution plan. The reason for this mode is that there is no a single device to run all types of workloads efficiently, and there are many factors, such as input size, number of threads, etc. that can fluctuate performance of applications on parallel hardware. TornadoVM, will evaluate which device offers the best performance and migrate execution during runtime. You can find information here and also here and examples on GitHub.
Batch Processing for Big Data Workloads
Some workloads, especially applications for data analytics, require large amount of data to be processed. Heterogeneous devices such as GPUs usually do not have big amount of memory compared to CPUs. Thus, TornadoVM offers an option for performing batch processing. With this mode, developers can choose a batch size (e.g., 1GB), and the TornadoVM runtime will automatically handle the execution and compilation for different batches.
Multi-device Execution (experimental)
Currently, task-graph are executed on a single accelerator (a GPU). Researchers have been experimenting with multi-device support in TornadoVM to automatically use multiple GPUs per task graph, and increase performance. This feature is still experimental and not available yet on the public repository, but code maintainers are working on this.
Integration with Big Data Platforms
Since TornadoVM is a framework for Java, it can be used by other Java projects. In this work, TornadoVM has been integrated into Apache Flink for processing big data workloads and streaming applications on GPUs and FPGAs in a transparent manner. In this type of work, developers write usual Apache Flink code, without any modifications in their programs, and the system automatically offloads compute-intensive workloads to GPUs and FPGAs.
Who is TornadoVM good for?
Is GPU/FPGA computing right for you?
As I mentioned at the beginning, hardware accelerators were designed to solve a specific type of applications by specializing the computer architecture. Thus, if your workloads follow the type of the tasks that the accelerator was designed for, your programs will become faster and more efficient in that hardware.
GPUs were designed for high throughput applications, with a heavy focus on computer graphics and 2D matrices. Thus, applications that follow this category will be beneficial from GPU hardware. Some domains that require GPU compute are, ML and DL, Big Data analytics, Natural Language Processing, Fintech, Computer Vision, Ray Tracing, Physics and Math Simulations, and Astrophysics, just to name a few.
Is TornadoVM right for you?
To be able to benefit from TornadoVM, we must have Java workloads that are suitable for acceleration, such as the Blur-Filter that we coded in the previous example. Furthermore, TornadoVM might be right for you if you are looking for a hardware-agnostic way of programming heterogeneous hardware.
Programming models such as OpenCL, or CUDA, are great to achieve performance, but at the cost of programmability. Besides, since performance tuning is also part of the development process with these programming frameworks, developers must be willing to adapt the code to new hardware. With TornadoVM, this is automatically handled by the TornadoVM runtime and JIT compiler.
Besides, TornadoVM offers features such as transparent task-migration across devices, batch processing, and profiling, to name a few, that can be very appealing for developers.
Note that TornadoVM is not a substitution of the usual Java execution and Java compilers (e.g., C1/C2, Graal), but rather a complement to achieve higher performance of specific types of applications. Thus, if you are running suitable workloads for GPU/FPGA in Java, and the TornadoVM features follows your needs, then TornadoVM can be considered as an option for hardware acceleration of your programs.
Conclusion
Heterogeneous hardware is present in almost every computing system as a means of increasing performance and processing applications more efficiently for specialized workloads. However, we also need a way to program these systems efficiently. In this article, we have explored TornadoVM, a parallel programming framework for accelerating Java applications on heterogenous hardware. This article has covered the basics of TornadoVM and shown an example about how a computational photography filter implemented in Java can be accelerated on commodity GPUs.
This article has also briefly explained the key features of TornadoVM and discuss weather GPU and FPGA acceleration is a good target for your applications. This article nearly scratches the surface of the TornadoVM framework, but I hope it gives developers a better understanding about when and why to use such frameworks.